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Kernel Logistic Regression 
Roadmap 
@ Embedding Numerous Features: Kernel Models 


Lecture 4: Soft-Margin Support Vector Machine 


allow some margin violations ¢, while penalizing 
them by C; equivalent to upper-bounding a, by C 









Lecture 5: Kernel Logistic Regression 


e Soft-Margin SVM as Regularized Model 
e SVM versus Logistic Regression 

e SVM for Soft Binary Classification 

ə Kernel Logistic Regression 





@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 
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Kernel Logistic Regression Soft-Margin SVM as Regularized Model 


Wrap-Up 


Hard-Margin Primal 


-Margin Primal 






N 
ieee 
min 5 mon 


Yn(W! Zp ar b) > 1 — Ën,Ën > 0 


-Margin Dual 





soft-margin preferred in practice; 
linear: LIBLINEAR; non-linear: LIBSVM 
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Soft-Margin SVM as Regularized Model 


Slack Variables €n 


Kernel Logistic Regression 


e record ‘margin violation’ by £n 
e penalize with margin violation 


N 
ee 
m LL oe 
yn(w'z, + b) > 1 — ¿n and £n > 0 forall n 





s.t. 





on any (b,w), ¿n = margin violation = max(1 — Yn(WTZn + b), 0) 
e (Xn, Yn) violating margin: & = 1 — yn(WTZn + b) 
e (Xn, Yn) not violating margin: ¢, = 0 





‘unconstrained’ form of soft-margin SVM: 


i 1 X 
min zwWw +C 5 max(1 — yp(w’z, + b), 0) 


b,w 
n=1 





Kernel Logistic Regression Soft-Margin SVM as Regularized Model 


Unconstrained Form 







N 
« LI T 
min zw ao — Yn(w'z, + b), 0) 
familiar? :-) just L2 regularization 
: 1 7 E ; 
min zW wW+Cÿ err min Aw Two — 5 err 


with shorter w, another 
parameter, and special err 





why not solve this? :-) 
e not QP, no (?) kernel trick 


e max(-,0) not differentiable, harder to 
solve 


Hauan-Tien Un TUS) SE -i 





Kernel Logistic Regression Soft-Margin SVM as Regularized Model 


SVM as Regularized Model 
























minimize constraint 
regularization by constraint E; wiw<C 
hard-margin SVM ww En = 0 [and more] 
L2 regularization Aw! w + En 
soft-margin SVM aw! w + CNE 











large margin <=> fewer hyperplanes <= L2 regularization of short w 
soft margin == special err 


larger C or C <> smaller À <= less regularization 


viewing SVM as regularized model: 
allows extending/connecting to other learning models 
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Kernel Logistic Regression Soft-Margin SVM as Regularized Model 


Fun Time 


When viewing soft-margin SVM as regularized model, a larger C 
corresponds to 


© a larger à, that is, stronger regularization 
@ a smaller å, that is, stronger regularization 
© a larger à, that is, weaker regularization 
© a smaller \, that is, weaker regularization 





Kernel Logistic Regression Soft-Margin SVM as Regularized Model 


Fun Time 


When viewing soft-margin SVM as regularized model, a larger C 
corresponds to 


© a larger à, that is, stronger regularization 
© a smaller å, that is, stronger regularization 
© a larger à, that is, weaker regularization 
© a smaller å, that is, weaker regularization 


Reference Answer: (4) 


Comparing the formulations on page 4 of the 
slides, we see that C corresponds to a So 
larger C corresponds to smaller A, which 
surely means weaker regularization. 
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Kernel Logistic Regression SVM versus Logistic Regression 


Algorithmic Error Measure of SVM 


7 


N 
i 1 T : T f 
min 5wiw+ 2e max(1 — yn(w'z, + b),0) 














linear score s = w7z, + b 6 —0/1 
e errg/i(S, y) = [ys < 0] 4 
e Errsvm(S, y) = max(1 — ys, 0): me 
upper bound of errg A 
—often called hinge error measure 0 
S21 0 1 2 8 


érrsyu: algorithmic error measure 
by convex upper bound of erro; 


Kernel Logistic Regression SVM versus Logistic Regression 


Algorithmic Error Measure of SVM 


N 
i 1 E 
min zwWw +C 2 max(1 — yn(w" Zn + b),0) 


7 














linear score s = W! Zp + b 6 mAN 
e etto/1(S, Y) = [ys < 0] 4 
e Errsvm(S, y) = max(1 — ys, 0): SE 
upper bound of erro/ A 
—often called hinge error measure 0 
-3 -2 -1 0 1 2 3 


ys 


érrsyu: algorithmic error measure 
by convex upper bound of erro /; 


Kernel Logistic Regression SVM versus Logistic Regression 


Connection between SVM and Logistic Regression 











linear score s = w!z,+b 6 ale 
—scaled ce 
e erro/1(S, y) = [ys < 0] 4 
° etrsvu(S, y) = max(1 = YS, 0): orr 
upper bound of erro/: ; 
e errsce(s, y) = log,(1 + exp(—ys)): 0 
another upper bound of errg/; used in 3-2-1 90 1 2 8 
logistic regression 
—0oo — ys — +00 
—Yys ettgyu(S, Y) =0 


x -ys (In 2) - errsce(s, y) 





SVM = L2-regularized logistic regression | 


Kernel Logistic Regression SVM versus Logistic Regression 


Linear Models for Binary Classification 


soft-margin regularized 


SVM logistic regression 
for classification 
minimize regularized 
errsce by GD/SGD/... 
e pros: ‘easy’ 
optimization & 
regularization 
guard 
e cons: loose 
bound of errg/ for 
very negative ys 




























minimize 

erro/1 specially 

e pros: efficient if 
lin. separable 


minimize regularized 

CSYM by QP 

e pros: ‘easy’ 
optimization & 
theoretical 
guarantee 










e cons: loose 
bound of erro; for 
very negative ys 


e cons: works only 
if lin. separable, 
otherwise 

needing pocket 


regularized LogReg = approximate SVM 
SVM = approximate LogReg (7?) | 
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Kernel Logistic Regression SVM versus Logistic Regression 


Fun Time 


We know that errgym(S, y) is an upper bound of errg/1(s, y). When is 
the upper bound tight? That is, when is érrsyu(s, y) = erro/1(5, y)? 


© ys>0 
@ ys <0 
© ys>1 


O ys<1 





SVM versus Logistic Regression 


Fun Time 


Kernel Logistic Regression 


We know that errgym(S, y) is an upper bound of errg/1(s, y). When is 
the upper bound tight? That is, when is érrsyu(S, yY) = erro/1(S, yY)? 
0 ys>0 
@ ys <0 
© ys>1 
O ys<1 








| 


By plotting the figure, we can easily see that 
etrsvu(S, Y) = erro/1(s, y) if and only if ys > 1. 
In that case, both error functions evaluate to O. | 
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Kernel Logistic Regression SVM for Soft Binary Classification 


SVM for Soft Binary Classification 


Naive Idea 2 











Naïve Idea 1 
@ run SVM and get 
(bsvm, Wsvu) 
@ return 
g(x) = O(WayyX + bsvu) 





@ run SVM and get 
(bsvm, Wsvu) 

@ run LogReg with 
(Osvm,Wsvm) aS Wo 

© return LogReg solution as 
g(x) 

e not really ‘easier’ than 
original LogReg 

e SVM flavor (kernel?) lost 


e ‘direct’ use of similarity 
—works reasonably well 


e no LogReg flavor 








want: flavors from both sides 
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Kernel Logistic Regression SVM for Soft Binary Classification 


A Possible Model: Two-Level Learning 
Q(X) = O(A - (Wy ®(X) + bsvm) + B) 


e SVM flavor: fix hyperplane direction by Wsy,—kernel applies 

e LogReg flavor: fine-tune hyperplane to match maximum 
likelihood by scaling (A) and shifting (B) 

e often A > 0 if Wsvm reasonably good 

e often B = 0 if bsvm reasonably good 






new LogReg Problem: 


N 
1 
min N ». log | 1 + exp —yn(A i (Wy (Xn) + Dsvu ) + B) 


Pl Psvu(Xn) 


two-level learning: 
LogReg on SVM-transformed data | 


Kernel Logistic Regression SVM for Soft Binary Classification 


Probabilistic SVM 
Platt’s Model of Probabilistic SVM for Soft Binary Classification 
© run SVM on D to get (bsym, Wswu) [or the equivalent a], and 

transform D to z', = wy ®(Xn) + bsvm 

—actual model performs this step in a more complicated manner 
@ run LogReg on {(z/,, ¥n)}h_1 to get (A, B) 

—actual model adds some special regularization here 
© return g(x) = 0(A- (Wey (x) + bsvu) + B) 













e soft binary classifier not having the same boundary as SVM 
classifier 
—because of B 

e how to solve LogReg: GD/SGD/or better 

—because only two variables 





kernel SVM — approx. LogReg in Z-space 
exact LogReg in Z-space? | 
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Kernel Logistic Regression SVM for Soft Binary Classification 


Fun Time 
Recall that the score w4,,b(x) + Dsvm = So anynK (Xn, X) + bsvu for the 
SV 


kernel SVM. When coupling the kernel SVM with (A, B) to form a 
probabilistic SVM, which of the following is the resulting g(x)? 


@ 0 (= BanynK(Xn, x) ar bam ) 
ef BanynK (Xn, X) + Bbsvm + A) 
on AanynK (Xn, X) + bam ) 


LE AanynK (Xn, X) + Absym + B) 





Kernel Logistic Regression SVM for Soft Binary Classification 


Fun Time 
Recall that the score w4,,b(x) + Dsvm = So anynK (Xn, X) + bsvu for the 
SV 


kernel SVM. When coupling the kernel SVM with (A, B) to forma 
probabilistic SVM, which of the following is the resulting g(x)? 


@ 0 (= BanynK (Xn, X) + bam ) 
SV 

6 0 (= BanYnK (Xn, X) + Bbsvm + A) 
SV 


© 0 (= AanynK (Xn, x) ar bam ) 
SV 





4 0 (= AanynK (Xn, X) + Absym + B) 
SV 


Reference Answer: (4) 


We can simply plug the kernel formula of the 


score into g(x). | 
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Kernel Logistic Regression Kernel Logistic Regression 


Key behind Kernel Trick 


N 
one key behind kernel trick: optimal w, = >> Gn2n 
n=1 








N N 
because w/z = ` Bpziz= >> BnK(Xn,X) 
n=1 n=1 






| LogReg by SGD 

















N N N 
Wsvm = X (anyn)Zn Wea = X (anyn)Zn WLOGREG = X (anyn)Zn 
n=1 n=1 n=1 
an from dual an by # mistake an by total SGD 
solutions corrections moves 








when can optimal w., be represented by Zn? | 


Kernel Logistic Regression Kernel Logistic Regression 


Representer Theorem 
claim: for any L2-regularized linear model 





N 
TE ee, - 
min Ww ro ZM) 


optimal w, = J^} Bn2n. 


e let optimal w, = w) + w1, where w) € span(Zn) & w, | span(zn) 
—wantw, = 0 
e what if not? Consider w| 


e of same err as W,.: err(Yn, W] Zn) = err(Yn, (Wy +W:)/z,) 

e of smaller regularizer as w.: 
wlw, = wy Wj + 2wi wy +wiw, > wiw] 
—W] ‘more optimal’ than w, (contradiction!) 


any L2-regularized linear model 
can be kernelized! 
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Kernel Logistic Regression Kernel Logistic Regression 


Kernel Logistic Regression 
solving L2-regularized logistic regression 


min Aww r: ro (1 + exp (—ynw’ zn) ) 





yields optimal solution w, = SV. BnZn 


with out loss of generality, can solve for optimal 8 instead of w 


N N N 1 N N 
min D D bnbmK (Xn, Xm) + 77 D log ( + exp (-» X Bink mm) ) 


n=1 m=1 n=1 m=1 





kernel logistic regression: 
use representer theorem for kernel trick 


on L2-regularized logistic regression 





Kernel Logistic Regression Kernel Logistic Regression 


Kernel Logistic Regression (KLR) : Another View 
min À S 5 BnBmK (Xn, Xm) 7 DO log ( + exp (-» 5 BmK (Xm, *»)) 


n=l m=i n=l m=1 









© SON, mK (Xm, Xn): inner product between variables 3 and 
transformed data (K(X1,Xn), K(X2, Xn), - - -, K(X, Xn)) 
© 5^ DN BnBmK (Xn, Xm): a special regularizer 8™KG 
e KLR = linear model of G 
with kernel as transform & kernel regularizer; 
= linear model of w 
with embedded-in-kernel transform & L2 regularizer 


e similar for SVM 





warning: unlike coefficients a, in SVM, 
coefficients 8n in KLR often non-zero! | 


Kernel Logistic Regression Kernel Logistic Regression 


Fun Time 


When viewing KLR as linear model of 8 with embedded-in-kernel 
transform & kernel regularizer, what is the dimension of the Z space 
that the linear model operates on? 


© d, the dimension of the original ¥ space 
© N, the number of training examples 


© d, the dimension of some feature transform (x) that is 
embedded within the kernel 





© \, the regularization parameter 


Kernel Logistic Regression Kernel Logistic Regression 


Fun Time 


When viewing KLR as linear model of 8 with embedded-in-kernel 
transform & kernel regularizer, what is the dimension of the Z space 
that the linear model operates on? 


© d, the dimension of the original ¥ space 
© N, the number of training examples 


© d, the dimension of some feature transform (x) that is 
embedded within the kernel 


© \, the regularization parameter 





Reference Answer: (2) 


For any x, the transformed data is 
(K(x1,X), K(X2,X),..., K(X, X)), which is 
N-dimensional. 
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Summary 
@ Embedding Numerous Features: Kernel Models 


Lecture 5: Kernel Logistic Regression 


e Soft-Margin SVM as Regularized Model 
L2-regularization with hinge error measure 
e SVM versus Logistic Regression 
= L2-regularized logistic regression 
e SVM for Soft Binary Classification 
common approach: two-level learning 
e Kernel Logistic Regression 
representer theorem on L2-regularized LogReg 





e next: kernel models for regression 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 





